Lab 2 - EDA and Feature Engineering

SOLUTIONS

Packages

Here are some of the packages we’ll use in this lab.

# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
  install.packages("librarian")
}
  
# load packages if not already loaded
librarian::shelf(tidyverse, magrittr, gt, gtExtras, tidymodels, DataExplorer)
Warning: package 'gt' was built under R version 4.3.3
Warning: package 'modeldata' was built under R version 4.3.3
# library(magrittr)     # the pipe
# library(tidyverse)    # for data wrangling + visualization
# library(tidymodels)   # for modeling
# library(gt)           # for making display tables
# library(gtExtras)     # helper functions for beautiful tables
# library(DataExplorer) #

Data: The Tate Collection

Tate is an institution that houses the United Kingdom’s national collection of British art, and international modern and contemporary art. It is a network of four art museums: Tate Britain, London (until 2000 known as the Tate Gallery, founded 1897), Tate Liverpool (founded 1988), Tate St Ives, Cornwall (founded 1993) and Tate Modern, London (founded 2000), with a complementary website, Tate Online (created 1998). Tate is not a government institution, but its main sponsor is the UK Department for Culture, Media and Sport.

This dataset used here contains the metadata for around 70,000 artworks that Tate owns or jointly owns with the National Galleries of Scotland as part of ARTIST ROOMS. Metadata for around 3,500 associated artists is also included.

The metadata here is released under the Creative Commons Public Domain CC0 licence. Images are not included and are not part of the dataset.

This dataset contains the following information for each artwork:

Id acquisitionYear
accession_number dimensions
artist width
artistRole height
artistId depth
title units
dateText inscription
medium thumbnailCopyright
creditLine thumbnailUr
year url

Use the code below to load the Tate Collection data sets, and note the names of the variable referencing each dataset.

the_tate <- 
  readr::read_delim(
    "../data/the-tate-collection.csv"
    , ";"
    , escape_double = FALSE
    , trim_ws = TRUE
  )
the_tate_artists <- readr::read_csv("../data/the-tate-artists.csv")

Exercises

Exercise 1

First of all, let’s analyze the entire dataset as it is. We have 69201 observations, each one corresponding to an artwork in Tate collection. For each observation/artwork, we have 20 attributes, including artist, title, date, medium, dimensions and Tate’s acquisition year. Generate some general observations about the dataset using dplyr::summarize, including the number of unique artists represented in the collection, the period represented in the collection and the acquisition period over which the collection was created.

Next use DataExplorer::introduce and DataExplorer::plot_missing() to examine the scope of missing data.

SOLUTION:
the_tate %>% dim()
[1] 69201    20
tts <- the_tate %>% 
  dplyr::summarize(
    num_artists = length(unique(artist))
    , period_start = min(year, na.rm = T)
    , period_end = max(year, na.rm = T)
    , acquisition_start = min(acquisitionYear, na.rm = T)
    , acquisition_end = max(acquisitionYear, na.rm = T)
  ) %T>%
  (\(x) print(x))
# A tibble: 1 × 5
  num_artists period_start period_end acquisition_start acquisition_end
        <int>        <dbl>      <dbl>             <dbl>           <dbl>
1        3336         1545       2012              1823            2013
stringr::str_glue(
  "The works of {tts$num_artists} artists, created between {tts$period_start}-{tts$period_end}, were acquired by the Tate from {tts$acquisition_start} to {tts$acquisition_end}")
The works of 3336 artists, created between 1545-2012, were acquired by the Tate from 1823 to 2013
the_tate %>% DataExplorer::introduce() %>% dplyr::glimpse()
the_tate %>% DataExplorer::plot_missing()
Rows: 1
Columns: 9
$ rows                 <int> 69201
$ columns              <int> 20
$ discrete_columns     <int> 13
$ continuous_columns   <int> 7
$ all_missing_columns  <int> 0
$ total_missing_values <int> 219798
$ complete_rows        <int> 155
$ total_observations   <int> 1384020
$ memory_usage         <dbl> 43813752

Exercise 2

Roughly 7.8% of the works in the collection have missing dates, How many works have missing dates (i.e. the number)

Use the table() function to count the number of works missing for each artist. Convert the table in to a tibble using tibble::as_tibble(), and then sort the count in descending order.

How many artists have works with missing dates?

Mutate the resulting table, adding columns for the percent of the total missing data for each artist, and another for the cumulative percent (just apply cumsum() to the percentage for each artist.

If we could identify all the missing dates for each artists, what is the smallest number of arists needed to resolve at least 50% of the missing year data?

Is this missing data MCAR, MAR, or MNAR?

SOLUTION:
missing_year <- the_tate %>% 
  dplyr::filter(is.na(year))  # select the rows where the year value is missing
missing_year %>% dim()
[1] 5397   20
missing_artist_tbl <- 
  missing_year$artist %>%         # take the artist column from the table of missing years
  table() %>%                     # make a table of counts for each artist 
  tibble::as_tibble() %>%         # convert it to a tibble / data.frame
  dplyr::rename(artist = 1) %>%   # rename the first column
  arrange(desc(n)) %>%            # arrange in descending order by count
  mutate(                         # add or update columns
    total = sum(n)                # create a temporary column: sum of all counts
    , pct_of_missing = n/total    # calculate the % missing for each artist
    , cum_pct = 
      cumsum(pct_of_missing)      # calculate the cumulative % missing
  ) %T>% 
  (\(x) print(dim(x))) %>% 
  dplyr::select(-total)           # drop the temporary column
[1] 461   5
missing_artist_tbl %>% 
  dplyr::filter(cum_pct <= 0.51) %>% 
  print(n=100)
# A tibble: 11 × 4
   artist                             n pct_of_missing cum_pct
   <chr>                          <int>          <dbl>   <dbl>
 1 Jones, George                   1039         0.193    0.193
 2 Turner, Joseph Mallord William   343         0.0636   0.256
 3 British (?) School               325         0.0602   0.316
 4 Cozens, Alexander                209         0.0387   0.355
 5 Dance-Holland, Sir Nathaniel     163         0.0302   0.385
 6 Stothard, Thomas                 158         0.0293   0.414
 7 Flaxman, John                    136         0.0252   0.440
 8 Barlow, Francis                  106         0.0196   0.459
 9 Davis, John Scarlett              83         0.0154   0.475
10 Hunt, William Henry               78         0.0145   0.489
11 Callcott, Sir Augustus Wall       75         0.0139   0.503

There are 5397 works with missing dates, 461 artists with whose works have missing dates, and the works of 11 artists account for almost 50% of the missing dates.

Since most of the missing year data is associated with a handful of artists, the missing data would be classified as MAR.

Exercise 3

Prepare a table showing the number of works for each unique artist, ordered from the largest number of works to the smallest. Show the top 10 artists by number of works in the collection.

SOLUTION:
tate_artists_tbl <- 
  the_tate$artist %>%             # take the artist column from the complete dataset
  table() %>%                     # make a table of counts for each artist 
  tibble::as_tibble() %>%         # convert it to a tibble / data.frame
  dplyr::rename(artist = 1) %>%   # rename the first column
  arrange(desc(n))                # arrange in descending order by count

tate_artists_tbl
# A tibble: 3,336 × 2
   artist                             n
   <chr>                          <int>
 1 Turner, Joseph Mallord William 39389
 2 Jones, George                   1046
 3 Moore, Henry, OM, CH             623
 4 Daniell, William                 612
 5 Beuys, Joseph                    578
 6 British (?) School               388
 7 Paolozzi, Sir Eduardo            385
 8 Flaxman, John                    287
 9 Phillips, Esq Tom                274
10 Warhol, Andy                     272
# ℹ 3,326 more rows

Exercise 4

Modify the table from the last exercise to show the percentage of the total collection that each artist represents. Format the table using gt::gt with the percentage column formatted for display as a percentage, to two decimals. Apply a theme from the gtExtras package to the formatted table.

SOLUTION:
tate_artists_tbl %>% 
  dplyr::mutate(
    pct_of_collection = n/sum(n, na.rm = T)
  ) %>% 
  dplyr::slice_head(n=10) %>% 
  gt::gt('artist') %>% 
  gt::fmt_percent(column = pct_of_collection) %>% 
  gtExtras::gt_theme_538()
Table has no assigned ID, using random ID 'kvwuelljdd' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`
n pct_of_collection
Turner, Joseph Mallord William 39389 56.92%
Jones, George 1046 1.51%
Moore, Henry, OM, CH 623 0.90%
Daniell, William 612 0.88%
Beuys, Joseph 578 0.84%
British (?) School 388 0.56%
Paolozzi, Sir Eduardo 385 0.56%
Flaxman, John 287 0.41%
Phillips, Esq Tom 274 0.40%
Warhol, Andy 272 0.39%

Exercise 5

Using the tibble the_tate, select the columns for artist and title and count the number of rows.

Next take the tibble the_tate, select the columns for artist and title, and then apply dplyr::distinct. Count the distinct artist-title pairs.

How many are duplicated?

SOLUTION:
all_data_dim <- 
the_tate %>% 
  dplyr::select(artist, title) %T>% 
  (\(x) print(dim(x))) %>% 
  dim()
[1] 69201     2
no_dups_data_dim <- 
  the_tate %>% 
  dplyr::select(artist, title) %>% 
  dplyr::distinct(artist, title) %T>% 
  (\(x) print(dim(x))) %>% 
  dim()
[1] 45496     2
stringr::str_glue(
  "The full dataset has {all_data_dim[1]} rows and after removing duplicates we have {no_dups_data_dim[1]} rows, so there are {all_data_dim[1] - no_dups_data_dim[1]} duplicate rows."
)
The full dataset has 69201 rows and after removing duplicates we have 45496 rows, so there are 23705 duplicate rows.

Exercise 6

Similar to exercises 2 and 3, in this exercise take the raw data (the_tate) and add a column with the area of each artwork in \(\text{cm}^2\). Next select the artist, title and the area and remove NA values using tidyr::drop_na, then order the works by area. Use dplyr::slice_head and dplyr::slice_tail to find the largest and smallest artworks in the collection.

SOLUTION:
artwork_area_tbl <- 
  the_tate %>% 
  dplyr::mutate(                         # from the units column we know
    area = width * height / 100          # that the width and height are in mm
  ) %>%                                  # so we divided each by 10 to get cm
  dplyr::select(artist, title, area) %>% 
  tidyr::drop_na() %>%                   # drop all rows with any missing values
  dplyr::arrange(desc(area))             # sort from largest to smallest

artwork_area_tbl %>% 
  dplyr::slice_head(n=1)                 # select the first row from the top
artwork_area_tbl %>% 
  dplyr::slice_tail(n=1)                 # select the first row from the bottom
# A tibble: 1 × 3
  artist           title                               area
  <chr>            <chr>                              <dbl>
1 Therrien, Robert No Title (Table and Four Chairs) 1324620
# A tibble: 1 × 3
  artist         title            area
  <chr>          <chr>           <dbl>
1 Mesens, E.L.T. Thème de Ballet  2.37

Exercise 7

Join the tables the_tate and the_tate_artists using dplyr::left_join, assigning the result to the variable the_tate . Drop rows with NA gender values and then group by gender. Show the resulting table.

SOLUTION:
# the_tate_artists <- readr::read_csv("labs/data/the-tate-artists.csv")
the_tate %<>% 
  dplyr::left_join(
    the_tate_artists
    , by = c("artistId" = "id")
  )
summary(the_tate)
       id         accession_number      artist           artistRole       
 Min.   :     3   Length:69201       Length:69201       Length:69201      
 1st Qu.: 19096   Class :character   Class :character   Class :character  
 Median : 37339   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 39148                                                           
 3rd Qu.: 54712                                                           
 Max.   :129068                                                           
                                                                          
    artistId        title             dateText            medium         
 Min.   :    0   Length:69201       Length:69201       Length:69201      
 1st Qu.:  558   Class :character   Class :character   Class :character  
 Median :  558   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 1201                                                           
 3rd Qu.: 1137                                                           
 Max.   :19232                                                           
                                                                         
  creditLine             year      acquisitionYear  dimensions       
 Length:69201       Min.   :1545   Min.   :1823    Length:69201      
 Class :character   1st Qu.:1817   1st Qu.:1856    Class :character  
 Mode  :character   Median :1831   Median :1856    Mode  :character  
                    Mean   :1867   Mean   :1911                      
                    3rd Qu.:1953   3rd Qu.:1982                      
                    Max.   :2012   Max.   :2013                      
                    NA's   :5397   NA's   :45                        
     width             height            depth             units          
 Min.   :    3.0   Min.   :    6.0   Min.   :    1.00   Length:69201      
 1st Qu.:  118.0   1st Qu.:  117.0   1st Qu.:   48.25   Class :character  
 Median :  175.0   Median :  190.0   Median :  190.00   Mode  :character  
 Mean   :  323.5   Mean   :  346.4   Mean   :  479.20                     
 3rd Qu.:  345.0   3rd Qu.:  359.0   3rd Qu.:  450.00                     
 Max.   :11960.0   Max.   :37500.0   Max.   :18290.00                     
 NA's   :3367      NA's   :3342      NA's   :66687                        
 inscription        thumbnailCopyright thumbnailUrl          url.x          
 Length:69201       Length:69201       Length:69201       Length:69201      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
     name              gender             dates            yearOfBirth  
 Length:69201       Length:69201       Length:69201       Min.   :1500  
 Class :character   Class :character   Class :character   1st Qu.:1775  
 Mode  :character   Mode  :character   Mode  :character   Median :1775  
                                                          Mean   :1818  
                                                          3rd Qu.:1894  
                                                          Max.   :2004  
                                                          NA's   :544   
  yearOfDeath   placeOfBirth       placeOfDeath          url.y          
 Min.   :1570   Length:69201       Length:69201       Length:69201      
 1st Qu.:1851   Class :character   Class :character   Class :character  
 Median :1851   Mode  :character   Mode  :character   Mode  :character  
 Mean   :1875                                                           
 3rd Qu.:1864                                                           
 Max.   :2014                                                           
 NA's   :8774                                                           
tate_gender_tbl <- the_tate %>% 
  tidyr::drop_na(gender) %>%
  dplyr::group_by(gender) %T>% 
  (\(x) print(x)) 
# A tibble: 68,501 × 28
# Groups:   gender [2]
      id accession_number artist       artistRole artistId title dateText medium
   <dbl> <chr>            <chr>        <chr>         <dbl> <chr> <chr>    <chr> 
 1 20400 P77527           Charlton, A… artist          891 [no … 1991     Scree…
 2 20618 P77580           Artschwager… artist          669 Inte… 1972     Scree…
 3 20830 P77612           Marden, Bri… artist         1578 [no … 1971     Etchi…
 4 21086 P77680           Francis, Ma… artist         2311 Unti… 1994     Monot…
 5 21163 P77699           Self, Colin  artist         1922 Powe… 1968     Scree…
 6 21157 P77704           Woodrow, Bi… artist         2170 [ind… 1994     Linoc…
 7 21153 P77708           Woodrow, Bi… artist         2170 Iron  1994     Linoc…
 8 21210 P77731           Sherman, Ci… artist         1938 Unti… 1982     Photo…
 9 21271 P77738           Wilding, Al… artist         2146 [no … 1994     Etchi…
10 21405 P77748           Struth, Tho… artist         2339 Vico… 1988     Photo…
# ℹ 68,491 more rows
# ℹ 20 more variables: creditLine <chr>, year <dbl>, acquisitionYear <dbl>,
#   dimensions <chr>, width <dbl>, height <dbl>, depth <dbl>, units <chr>,
#   inscription <chr>, thumbnailCopyright <chr>, thumbnailUrl <chr>,
#   url.x <chr>, name <chr>, gender <chr>, dates <chr>, yearOfBirth <dbl>,
#   yearOfDeath <dbl>, placeOfBirth <chr>, placeOfDeath <chr>, url.y <chr>
# NOT PART OF THE LAB
# For this problem assume I don't know how many gender labels 
# are in the dataset. To proceed, I make a tibble of the labels
# then I add a column for the counts. Finally, I calculate the
# counts by mapping a function against the labels in the tibble.

# I use map_int because I know I want an integer result; otherwise
# purrr::map() will retuurn a nested column
tibble::tibble( 
  labels = unique(tate_gender_tbl$gender)            
) %>% 
  dplyr::mutate(
    count =
      purrr::map_int(
        labels
        , (\(x) sum( tate_gender_tbl$gender == x ))
      )
  )
# A tibble: 2 × 2
  labels count
  <chr>  <int>
1 Male   65774
2 Female  2727

Exercise 8

In the next two exercises we switch to a different dataset, the historical price data for the S&P 500 Index.

Read the historical price data in the file SPX_HistoricalData_1692322132002.csv using readr:: read_csv and add a column for the year of the transaction and the daily return \(r_d\), using the formula

\[ r_d\equiv \log \frac{\text{Close/Last}_{t=i}}{\text{Close/Last}_{t=i-1}} \]You will likely need dplyr::lead or dplyr::lag functions. Add an additional column for the daily return variance \(\text{var}_d = \text{r}_d^2\).

Finally, group by year and use dplyr::summary to compute the annual returns and standard deviations. Add the argument .groups = "drop" to the dplyr::summarize function to drop the grouping after the summary is created.

SOLUTION:
# read the dataset 
spx_data <- 
  readr::read_csv(
    "../data/SPX_HistoricalData_1692322132002.csv"
    , show_col_types = FALSE
  )
# NOT PART OF THE LAB
# The package gt:: has a nice funcion for EDA.
# Give it a try!
# spx_data %>% gtExtras::gt_plt_summary("S&P 500 data")
spx_data %<>% 
  dplyr::mutate(
    Date = lubridate::mdy(Date)                   # Date is a character string in the data
                                                  # so it need to be converted to a date
    , year = lubridate::year(Date)                # extract the year from the date
    , return = 
      log(`Close/Last`/dplyr::lead(`Close/Last`)) # calculate the return
                                                  # verify whether to use lead or lag by hand
    , var = return^2                              # calculate the variance
  ) %T>% 
  (\(x) print(x))
# A tibble: 1,257 × 9
   Date       `Close/Last` Volume  Open  High   Low  year    return          var
   <date>            <dbl> <chr>  <dbl> <dbl> <dbl> <dbl>     <dbl>        <dbl>
 1 2023-08-16        4404. --     4434. 4450. 4404.  2023 -0.00758  0.0000575   
 2 2023-08-15        4438. --     4479. 4479. 4432.  2023 -0.0116   0.000135    
 3 2023-08-14        4490. --     4458. 4490. 4453.  2023  0.00573  0.0000329   
 4 2023-08-11        4464. --     4451. 4476. 4444.  2023 -0.00107  0.00000115  
 5 2023-08-10        4469. --     4487. 4527. 4458.  2023  0.000251 0.0000000628
 6 2023-08-09        4468. --     4502. 4502. 4461.  2023 -0.00706  0.0000499   
 7 2023-08-08        4499. --     4498. 4503. 4464.  2023 -0.00423  0.0000179   
 8 2023-08-07        4518. --     4492. 4520. 4491.  2023  0.00898  0.0000807   
 9 2023-08-04        4478. --     4514. 4540. 4475.  2023 -0.00531  0.0000282   
10 2023-08-03        4502. --     4494. 4519. 4486.  2023 -0.00255  0.00000651  
# ℹ 1,247 more rows
spx_return_tbl <- 
  spx_data %>% 
  dplyr::group_by(year) %>% 
  dplyr::summarize(
    return = exp( sum(return, na.rm = TRUE) ) - 1    # the annual return is the 
                                                     # exponential of the sum of the 
                                                     # log daily returns, less 1
    , volatility = sum(var, na.rm = TRUE) %>% sqrt() # the variance of a sum of random returns
                                                     # is the sum of the variances,
                                                     # and the volatility is the sqrt()
                                                     # of the variance  
    , .groups = "drop"
  ) %T>% 
  (\(x) print(x))
# A tibble: 6 × 3
   year return volatility
  <dbl>  <dbl>      <dbl>
1  2018 -0.120      0.121
2  2019  0.289      0.126
3  2020  0.163      0.347
4  2021  0.269      0.132
5  2022 -0.194      0.241
6  2023  0.147      0.107

Note that

\[ \begin{align*} \sum_{i=1}^{n}r_{i} & =\sum_{i=1}^{n}\log\frac{\text{Close/Last}_{t=i}}{\text{Close/Last}_{t=i-1}}\\ & =\log\prod_{i=1}^{n}\frac{\text{Close/Last}_{t=i}}{\text{Close/Last}_{t=i-1}}\\ & =\log\frac{\text{Close/Last}_{t=n}}{\text{Close/Last}_{t=n-1}}\times\cdots\times\frac{\text{Close/Last}_{t=2}}{\text{Close/Last}_{t=1}}\times\frac{\text{Close/Last}_{t=1}}{\text{Close/Last}_{t=0}}\\ & =\log\frac{\text{Close/Last}_{t=n}}{\text{Close/Last}_{t=0}} \end{align*} \] and \(\exp\left(\sum_{i=1}^{n}r_{i}\right)=\frac{\text{Close/Last}_{t=n}}{\text{Close/Last}_{t=0}}\), so \(\exp\left(\sum_{i=1}^{n}r_{i}\right)-1\) is the annual return. The value of 1 plus the annual return is sometimes called the total return.

Exercise 9

Take the table from the last exercise and use the gt:: package to format it. Add summary rows for the period return and period volatility (note that variances can be added; volatilities cannot- so you will need to do some calculations).

SOLUTION:
spx_return_tbl %>% 
  gt::gt('year') %>% 
  # form the columns as percents
  gt::fmt_percent(
    columns = c(return, volatility)
    , decimals=1                      # format to one decimal place
    , force_sign=TRUE                 # force the sign to be printed
  ) %>% 
  # add summary rows
  gt:: grand_summary_rows(
    columns = return                  # summarize just the returns
    , fns = 
      list(
        id = "ret"
        , label="period return"       # the return over multiple years is the product
                                      # of the annual total returns (1 + returns) - 1
      ) ~ sum(prod(1+.),-1) 
    , fmt = ~ gt::fmt_percent(., decimals = 1, force_sign=TRUE)
  ) %>% 
  gt:: grand_summary_rows(
    columns = volatility
    , fns = list(`period volatility` = ~sqrt(sum(.*.)) )
    , fmt = ~ gt::fmt_percent(., decimals = 1, force_sign=TRUE)
  ) %>% 
  gtExtras::gt_theme_espn()
return volatility
2018 −12.0% +12.1%
2019 +28.9% +12.6%
2020 +16.3% +34.7%
2021 +26.9% +13.2%
2022 −19.4% +24.1%
2023 +14.7% +10.7%
period return +54.5%
period volatility +48.8%

This matches up with the data from macrotrends. What we’ve is just calculated based on the stock price only and does not include dividends (so it is not the total return). Note that the data for 2023 and 2018 is incomplete in the dataset.

The return calculation here follows from the equations in the last exercise: we add one to get back to a ratio, then we multiply all the ratios for the individual period, subtraction one to get back to a return.

Note

Financial return math is not intuitive, so I will give full grades for the right structure of the code, even if the calculation is not quite right.

Resources for additional practice (optional)

  • Work/read through the TTC subway dataset example in Telling Stories with Data, Chapter 11.4: TTC Subway Delays